Safety certification of data-driven control techniques remains a major open problem. This work investigates backward reachability as a framework for providing collision avoidance guarantees for systems controlled by neural network (NN) policies. Because NNs are typically not invertible, existing methods conservatively assume a domain over which to relax the NN, which causes loose over-approximations of the set of states that could lead the system into the obstacle (i.e., backprojection (BP) sets). To address this issue, we introduce DRIP, an algorithm with a refinement loop on the relaxation domain, which substantially tightens the BP set bounds. Furthermore, we introduce a formulation that enables directly obtaining closed-form representations of polytopes to bound the BP sets tighter than prior work, which required solving linear programs and using hyper-rectangles. Furthermore, this work extends the NN relaxation algorithm to handle polytope domains, which further tightens the bounds on BP sets. DRIP is demonstrated in numerical experiments on control systems, including a ground robot controlled by a learned NN obstacle avoidance policy.
translated by 谷歌翻译
Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency. As the systems grow in complexity, fine-tuning architectural parameters across multiple sub-systems (e.g., datapath, memory blocks in different hierarchies, interconnects, compiler optimization, etc.) quickly results in a combinatorial explosion of design space. This makes domain-specific customization an extremely challenging task. Prior work explores using reinforcement learning (RL) and other optimization methods to automatically explore the large design space. However, these methods have traditionally relied on single-agent RL/ML formulations. It is unclear how scalable single-agent formulations are as we increase the complexity of the design space (e.g., full stack System-on-Chip design). Therefore, we propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem. The key idea behind using MARL is an observation that parameters across different sub-systems are more or less independent, thus allowing a decentralized role assigned to each agent. We test this hypothesis by designing domain-specific DRAM memory controller for several workload traces. Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines such as Proximal Policy Optimization and Soft Actor-Critic over different target objectives such as low power and latency. To this end, this work opens the pathway for new and promising research in MARL solutions for hardware architecture search.
translated by 谷歌翻译
我们介绍了DeepNash,这是一种能够学习从头开始播放不完美的信息游戏策略的自主代理,直到人类的专家级别。 Stratego是人工智能(AI)尚未掌握的少数标志性棋盘游戏之一。这个受欢迎的游戏具有$ 10^{535} $节点的巨大游戏树,即,$ 10^{175} $倍的$倍于GO。它具有在不完美的信息下需要决策的其他复杂性,类似于德克萨斯州Hold'em扑克,该扑克的游戏树较小(以$ 10^{164} $节点为单位)。 Stratego中的决策是在许多离散的动作上做出的,而动作与结果之间没有明显的联系。情节很长,在球员获胜之前经常有数百次动作,而Stratego中的情况则不能像扑克中那样轻松地分解成管理大小的子问题。由于这些原因,Stratego几十年来一直是AI领域的巨大挑战,现有的AI方法几乎没有达到业余比赛水平。 Deepnash使用游戏理论,无模型的深钢筋学习方法,而无需搜索,该方法学会通过自我播放来掌握Stratego。 DeepNash的关键组成部分的正则化NASH Dynamics(R-NAD)算法通过直接修改基础多项式学习动力学来收敛到近似NASH平衡,而不是围绕它“循环”。 Deepnash在Stratego中击败了现有的最先进的AI方法,并在Gravon Games平台上获得了年度(2022年)和历史前3名,并与人类专家竞争。
translated by 谷歌翻译
每年,在越来越复杂的多种域名,包括GO,Poker和Starcraft II在内的著名示例中都能达到专家级的性能。这种快速的进步伴随着相应的需求,以更好地了解这种代理如何实现这种绩效,以实现其安全的部署,确定局限性并揭示其改善它们的潜力。在本文中,我们从以性能为中心的多种学习中退后一步,而是将注意力转向代理行为分析。我们介绍了一种模型 - 反应方法,用于使用变异推理在多种基因域中发现行为簇,以学习关节和局部代理水平的行为层次结构。我们的框架没有对代理的基础学习算法的假设,不需要访问其潜在状态或模型,并且可以使用完全离线观察数据进行培训。我们说明了我们方法在联合和地方代理层面上对行为的耦合理解的有效性,在整个培训过程中检测行为变更点,发现核心行为概念(例如,那些促进更高回报的核心行为概念)的有效性,并证明了方法的可扩展性高维的多基金会木叶控制结构域。
translated by 谷歌翻译